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ABSTRACT 

In this paper we present the Clustering-Labels-Score Patterns Spotter (CLaSPS), a new methodol- 
ogy for the determination of correlations among astronomical observables in complex datasets, based 
on the application of distinct unsupervised clustering techniques. The novelty in CLaSPS is the cri- 
terion used for the selection of the optimal clusterings, based on a quantitative measure of the degree 
of correlation between the cluster memberships and the distribution of a set of observables, the labels, 
not employed for the clustering. CLaSPS has been primarily developed as a tool to tackle the chal- 
lenging complexity of the multi-wavelength complex and massive astronomical datasets produced by 
the federation of the data from modern automated astronomical facilities. In this paper we discuss the 
applications of CLaSPS to two simple astronomical datasets, both composed of extragalactic sources 
with photometric observations at different wavelengths from large area surveys. The first dataset, 
CSC-I-, is composed of optical quasars spectroscopically selected in the SDSS data, observed in the X- 
rays by Chandra and with multi-wavelength observations in the near-infrared, optical and ultraviolet 
spectral intervals. One of the results of the application of CLaSPS to the CSC-f- is the re-identification 
of a well-known correlation between the aox parameter and the near ultraviolet color, in a subset of 
CSC+ sources with relatively small values of the near-ultraviolet colors. The other dataset consists 
of a sample of blazars for which photometric observations in the optical, mid and near infrared are 
available, complemented for a subset of the sources, by Fermi 7-ray data. The main results of the 
application of CLaSPS to such datasets have been the discovery of a strong correlation between the 
multi- wavelength color distribution of blazars and their optical spectral classification in BL Lacs and 
Flat Spectrum Radio Quasars (FSRQs) and a peculiar pattern followed by blazars in the WISE mid- 
infrared colors space. This pattern and its physical interpretation have been discussed in details in 
other papers by one of the authors. 

Subject headings: Methods: statistical - Catalogs - Surveys 



1. INTRODUCTION 

The advancement of discovery in astronomy, from the 
statistical point of view, can be described as the suc- 
cessful application of several distinct Knowledge Discov- 
ery (KD) techniques to increasingly larger data samples. 
These techniques include: the classification of sources ac- 
cording to one or more observational quantities; pattern 
recognition for the discovery of correlations among ob- 
servable quantities; outlier selection for highlighting rare 
and/or unknown sources; regression, for the estimation 
of derived empirical properties from observed quantities. 
The discovery of new or unexpected correlations between 
observable quantities at different wavelengths, for exam- 
ple, has propelled the understanding of the nature of 
astronomical sources and their physical modeling (see, 
for example, the d iscovery of the fundament al plane of 
elliptical galaxies (Djorgovski & Davis 1987)), and the 
discovery of the link of the galaxy X-ray emission with 



different stellar populatioii s ( [Fabbiano fc Trinchieri|1985 
Fabbiano fc Shapley||2002| . 
'i'he effectiveness of patte 



pattern recognition techniques for 
the determination of correlations in low dimensional 
spaces (two or three dimensions) has usually relied on 
the ability of the astronomers to visualize the distribu- 
tion of data and make informed guesses about the nature 
of these patterns, based on theoretical models, reason- 



ableness and intuition. However, this approach becomes 
more and more ineffectual with the increase in complex- 
ity and size of the explored datasets. This difhculty has 
led to the introduction of KD techniques in the astro- 
nomical context. Such techniques are based on statistical 
and computational methodologies capable of automati- 
cally identifying useful correlations among parameters in 
a N-dimensional dataset without any a priori assumption 
on the nature of both data and the sought out patterns. 
Using these techniques, the focus of the astronomer can 
shift to the definition of the general problem to be in- 
vestigated, the selection of the interesting patterns and 
their physical interpretation. In this paper, we present 
CLaSPS, a new methodology based on KD techniques 
for the exploration of complex and massive astronomi- 
cal datasets and the detection of correlations among ob- 
servational parameters. While CLaSPS is designed for 
datasets containing very large number of sources, it is 
also well suited to handle small datasets, as will be shown 
in this paper. 

The adoption of KD methodologies in astronomy has 
only recently surged, due to the increasing availability of 
massive and complex datasets that would be almost in- 
tractable if tackled with the knowledge extractions tech- 
niques classically employed in astronomical research. A 
review of the advantages and most interesting applica- 



2 



R. D'Abrusco et al. 



tions of KD to astron omical problems can be found in 
( [Ball fc Brunner||2010[ ). The main reasons for the delay 
in the adoption ot such methods in astronomy are: a) 
datasets for which KD has an edge over classical meth- 
ods (because of their size and complexity) have become 
frequent only in the last ~15 years; b) slow transition 
from model-driven to data-driven research; c) lack of in- 
terdisciplinary expertise required for the application of 
KD techniques. Other disciplines for which the problem 
of dealing with massive datasets arose earlier, instead, 
have seen a steadier and faster growth of the number and 
importance of the KD tools employed on a regular basis. 
For example, the study of financial markets and complex 
networks and systems (applied to the WWW, advertise- 
ment placement, epidemiology, genetics, proteomics and 
security) have been on the forefront of application and 
development of KD techniques. Thorough reviews of the 
applications of KD methodologies to specific financial 
topics, i.e. customer man agement and fina ncial fraud 
detection, c an be found in ( Ngai et aL||2009 ) and (Ngai 



et al. 



2011) respectively. 



tiile a general re view of the 
role 01 KD in bio- informatics is provided in (Natarajan 



et al.|2005 ). Even if a certain degree of inter-disciplinary 



expertise is desired, domain-specific knowhow is crucial 
to narrow down the types and number of techniques that 
can be used to address the specific problems encountered 
in each field, and to interpret correctly the results of 
the application of such techniques to the data. Further- 
more, KD is only one of the skills necessary to tackle 
the new problems arising with the onset of dat a-driven 
astronomy, the other being astrostatistic s (e.g. Babu fc 
Feigelson 2007[), vis ualization techniques (Coi gparato et 



al. 2007" Way et al. 2011; Hassan & Fluke 201T I and ad^ 



vanced signal processing (Scarglej[2DD3j|Protopapas et al. 
|2006j ) . All these fields are curren tly the subject of a new 



discipline: the Astroinformatics ( Borne et al.|2011[ ). 

In this paper, we have focused our attention on the 
broad question of how efficiently the physical nature 
of astronomical sources can be characterized by multi- 
wavelength photometric data. We have applied CLaSPS 
to two datasets representing specific cases where such as- 
sumption can be tested and verified. CLaSPS assumes 
that low dimensional patterns in data are associated with 
aggregations (clusters) in the structure of the data in the 
high-dimensional ^'feature space" generated by all the ob- 
servables of the source^ These clusters are defined by 
the degree of correlation between the distribution of fea- 
tures (i.e., the observables used to build the feature space 
where clusters have been selected) and a set of external 
quantities, usually observables, metadata or a priori con- 
straints that have not been used for clustering. 

The CLaSPS method, based on the KD techniques for 
unsupervised clustering and the use of external informa- 
tion to label the clusters members, has been designed 
to tackle the problem of the extraction of information 
from two distinct classes of datasets: a) inhomogeneous 
large area datasets. The advancements in the Virtual 
Observatory (VO) technology are facilitating the access 
to datasets obtained by the combination of multiple ob- 



^ In general, any source with N measured observables can be 
represented as a point in a N-dimensional feature space, where the 
coordinates are the numerical values of the observables (or derived 
quantities). 



servations from different surveys with different observa- 
tional features (e.g., depth, spatial coverage and resolu- 
tion, spectral resolution) . Such datasets are, by construc- 
tion, inherently incomplete and are affected by the incon- 
sistency of the observational features of each set of obser- 
vations used to create them. We expect these datasets 
to grow in complexity as new data becomes available. 
KD techniques can facilitate the extraction of the avail- 
able knowledge contained in these "federated" inhomo- 
geneous samples, b) Large homogeneous datasets from 
multi- wavelength surveys of well-defined areas of the sky 
observed with similar depths at different wavelengths. 
These surveys typically yield large samples of sources, 
complete to a given flux. These datasets span limited 
but well characterized regions of the N-dimensional ob- 
servable feature space. The exploration of the structure 
of the multi-dimensional distribution of sources in the 
feature space may lead to the discovery of high dimen- 
sional correlations and patterns in the data that have 
been overlooked (or, simply, could not be established) in 
lower dimensional studies. 

This paper is organized as follows: in Sec. [2] we de- 
scribe the CLaSPS method, in Sec. [3] its application to 
the CSC-I- dataset, and in Sec.Hjits application to a sam- 
ple of blazars with multi-wavelength photometry avail- 
able. We discuss the future developments of CLaSPS in 
Sec.H 

2. CLASPS 

CLaSPS is based on well established data mining tech- 
niques for unsupervised clustering. These techniques 
search for spontaneous and inherent aggregations of data 
in the feature space generated by their observables. Ta- 
ble [T] summarizes the terms that will be used below. 
These techniques have been complemented by the use of 
external data (labels). Labels are observables not used for 
the clustering which can be used to character ize the con- 
tent of the s et of clusters or of single clusters. D'Abrusco] 
et al. (2009) used these techniques for the selection of op- 
tical candidate quasars from photometric datasets. They 
employ as label the spectroscopic classification available 
for a subset of the photometric sources. This method can 
be extended to use multiple labels, both numerical (e.g., 
fluxes, magnitudes, colors) and categorial (spectral clas- 
sification flags, morphological types). From a method- 
ological standpoint, the two tools required for this KD 
methodology are: 

(a) one or more unsupervised clustering algorithms, to 
determine multiple sets of partitions of the data, 
(the specific nieth ods used in this paper are dis- 
cussed in Sec. 2.2 ); 



(b) a quantitative measure of the degree of correlation 
between cluster populations and the values of the 
label associated with the members of the clusters 



(see Sec. 2.3 for more details); 



Once multiple clustering^ of a dataset in a given fea- 
ture space have been produced, the choice of the most 

^ The term "clustering" will be used in this paper to indicate one 
collection of clusters determined on any sample by any clustering 
algorithm. Multiple clusterings determined on the same sample 
can differ for several properties, namely the number of clusters, 
the number of members of the clusters, etc. 
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Table 1 

Definitions of the KD-related terms used in the paper. 
Term Definition 

O bservation An astronomical source as defined by 

a vector of its obscrvables 
Feature Any observable quantity of a given source 

used to determine a set of clusters 
Feature space An abstract space where each sample is 

represented as points in a N-dimensional space 
Cluster A set of sources (or observations) aggregated 

by a generic clustering algorithm 
Clustering A set of clusters representing a complete 

partition of a sample of observations 
Unsupervised In KD, unsupervised clustering refers to the 
clustering techniques used to determine the spontaneous 

aggregations of sources not using examples 
Label An observable of a set of observations used to 

label the members of the clusters (as defined 

in this paper) 

Score A quantitative diagnostic of the correlation 

between cluster membership and the distribution 
of labels (as defined in this paper) 



interesting partition of the dataset is performed consid- 
ering a quantitative evaluation of the degree of correla- 
tion between the distribution of the label and the cluster 
population in each clustering. Unlike most classical cri- 
teria selection that rely only on the intrinsic statistical 
properties of the clusterings, our method selects clus- 
terings based on both the distributions of features and 
of the associated labels. The degree of correlation be- 
tween features and labels can be generically expressed by 
a numerical quantity (the "score", see Sec. 2.3) that can 
be defined and calculated for every single clustering and 
cluster. 



2.1. CLaSPS and cluster 



ihles 



The task of combining multiple clusterings into a single 
partition is known, in the statistical/data mining litera- 
ture, as the search for the "consensus clustering" . This 
problem has been tho roughly studied, arid is di scussed 
in several papers (e.g. Ghosh & Acharya (2011)). The 
main reasons for the use of cluster ensembles techniques 
are: the improvement of the quality of the clustering, 
increased robustness of the clustering and the ability to 
combine "multiview" clusterings (i.e. of clusterings ob- 
tained with nonidentical s ets of sources and/or features) 
(Ghosh & Acharya 2011). CLaSPS selects the optimal 
clustering(s) from the point of view of the astrophysical 
interpretation of the correlations, according to the val- 
ues of the scores (Sec. 2.3). The scores are evaluated on 



the basis the clustering memberships and a given parti- 
tion the label, an external quantity not used to produce 
the clusterings. For this reason, CLaSPS neither tries 
to determine a "consensus clustering" nor attempts to 
combine clusterings or improve the properties of each dis- 
tinct clustering produced by the UC methods used. All 
the clusterings retain their own properties, biases and 
weaknesses, that have to be taken into account when in- 
terpreting the results of the application of CLaSPS. The 
CLaSPS method could be nonetheless improved by the 
application of cluster ensembles techniques, as discussed 
in Sec. [5l 

2.2. Unsupervised clustering 



In statistical terms, the cluster analysis of a sample is 
the determination of a segmentation of the data in groups 
or clusters, ea ch group r epresenting objects with similar 
properties ( HastieetaJJ2009 ; Hartigan 1975). The clus- 
ter analysis depends on the definition of a "dissimilarity" 
employed to assign the objects to different clusters. Usu- 
ally, the dissimilarity is evaluated on general attributes 
(or features) of the objects. E.g., the values of the ob- 
served fluxes in a given filter represent one of the features 
of an astronomical dataset. The pairwise dissimilarity 
between the i-th and A:-th observations on the values of 
the j-attribute can be defined as: 



D{xi,Xk) 



p 



dj (xjj , Xf^j ) 



(1) 



For quantitative attributes, the pairwise dissimilarity can 
be evaluated using the squared distance dj(xij,Xik) — 
(xij — Xik)'^. The individual dissimilarities evaluated for 
each attribute are then combined to produce a single 
overall dissimilarity between objects. The goal of the 
clustering algorithm is to partition the sources into clus- 
ters so that the pairwise dissimilarities between objects 
assigned to the same cluster are generally smaller that 
those in different clusters. In KD the term unsupervised 
refers to learning algorithm that do not require an ex- 
ample or a "teacher" to infer the properties of th e prob- 
ability dens ity associated with a given dataset (Hastie 
et al. 2009). The use of unsupervised techniques is 
relatively new in astronomy, while supervised learning 
techniques are very common and are usually applied to 
classification and regression problems. An early exam- 
ple of an application of unsupervised clustering to the 
problem of t he classification of gam ma-ray bursts can 
be found in (Mukherjee et al. 1998). The estimation 
of photometric redshitts for extragalactic sources, based 
on the spectroscopic redshifts measured for a subset of 
the sources has been tackled with several distinct KD 
me thods , for example connect ivity analysis (F reeman et 

01(3) ^Way 



aL|[2009!), gaussian processes (|Bonfield et al 

Srivastava 



and neur al networks ( Y eche et al 



2nTD{]Collis'ter fcT^aha^[2004 ') . A further example ot the 



combined use of unsupervised clustering and supervised 
learning techniq ues for photometric r edshifts estimation 
can be found in ( Laurino et al.|[20TI ). In general, unsu- 
pervised learning can be used to highlight the intrinsic 
structure of the data and as an exploratory tool. For 
some of these techniques, the only information that has 
to be provided before the clustering is performed is the 
final number of clusters. In the following subsections, we 
will shortly describe the three unsupervised clustering 
algorithms used in this work. 

2.2.1. K-means 



The K-means algorithm (Lloyd 



1957) is one of the 
most frequently used clustering methods. It is applicable 
when the attributes are quantitative and the dissimilar- 
ity measure is defined as the squared euclidean distance: 
d{xi,Xk) -Xkj)'^ = \ \xtj - XkjW^ ■ The N ob- 

servations are associated with K clusters so that in a 
cluster the average dissimilarity of the observations from 
the cluster mean is minimized. C* is the optimal clus- 
tering and Nk is the number of observations assigned to 
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the fc-th cluster, defined as: 

K 



k=l C(i) = k 

2.2.2. Hierarchical clustering 



(2) 



The result of the application of the K-means cluster- 



ing technique (Sec. 2.2.1 ) to a dataset depends on K, the 



number of clusters to be searched. Hierarchical cluster- 
ing methods do not require this number to be specified, 
instead they do require the user to specify a measure of 
dissimilarity between groups of observations based on the 
pairwise dissimilarities among the observations in the two 
groups. Overall, there are two strategies for hierarchical 
clustering: agglomerative and divisive. 

In the agglomerative strategy, the algorithm starts ag- 
gregating at the lowest level possible (each group is com- 
posed by only one observation) and at each level (or gen- 
eration), a pair of clusters is recursively merged into a 
single cluster. In the divisive strategy, the starting point 
is the top of the "tree" (all observations in one cluster) 
and at each level each cluster is recursively split into two 
new clusters. The merging in the agglomerative method, 
at each level, involves the aggregation of the two groups 
with the smallest intergroup dissimilarity. In the divisive 
methods, instead, at each level the splitting produces two 
new clusters with the largest possible between-group dis- 
similarity. Recursive splitting/agglomeration can be rep- 
resented by a rooted binary tree, where the nodes repre- 
sent groups. The root node is associated with the whole 
dataset and each terminal node represents one of the in- 
dividual observations. A common representation of the 



hierarchical structure called dendrogram (|HartigaiI[l975|) 
is obtained by plotting the binary tree so that the height 
of each node is proportional to the value of the intergroup 
dissimilarity between its "children" nodes. Hierarchical 
clustering techniques impose a hierarchical structure to 
the data even when such structure does not exists in the 
data. For this reason, in this paper we will not interpret 
the clusterings in terms of the properties of the hierar- 
chical structure they belong to, but only in terms of the 
properties of their feature distribution and of the prop- 
erties of the distribution of the labels associated with the 
cluster members. 

Agglomerative hierarchical clustering depends on: a) the 
choice of the definition of pairwise dissimilarity (i.e., dis- 
similarity between the members of a pair of observa- 
tions), and b) the agglomeration or "linking" strategy, 
i.e. the definition of the inter-group dissimilarity, usu- 
ally based on the pairwise dissimilarity adopted. Several 
pairwise dissimilarity definitions have been used for the 
method described in this paper; these include the Eu- 
clidean distance (see Eq. (4])), the Manhattan distance 
(Eq. ([5|) (also known as tne taxicab distance) and the 
maximum (or Chebyshev's) distance (Eq. (|6|^ 

^ These three distances are special cases of the general 
Minkowski's distance defined as: 



D{xi,Xk) - 



(3) 



for values of the parameter p equal to 2, 1 and oo respectively. 



D{xi,Xk) 



D{xi,Xk) 
D{xi,Xk) = 



N 



maxidjxy- 



(4) 

(5) 
(6) 



All the above metrics are suited for continuous mea- 
surements associated with the observations. The linking 
strategies used in this work are described below. The 
descriptions of the distinct linkage strategies are given in 
the case of agglomerative clusterings, but they are also 
valid for divisive clustering methods: 

a): Single linkage: the inter-clusters dissimilarity be- 
tween two generic clusters A and B can be defined 
as the minimum pairwise dissimilarity between ob- 
servations of each cluster: 



D{A, B)=mm{D{x,y) : x eA,y eB} 



(7) 



The clusters are grouped on the basis of the closest 
couple of members. For this reason, clusters which, 
on average, are not the closest but which share few 
nearby observations, can be merged. This is similar 
to what happens in clustering m ethods based on 
the "friends-of-friends" algorithm ( Hartigan|1975 ); 



b): Complete linkage: the inter-cluster dissimilarity be- 
tween two clusters A and B can be defined as the 
maximum pairwise dissimilarity between observa- 
tions belonging to the two clusters, namely: 



D{A, B)=max{D{x,y) : x eA,y eB} 



(8) 



In this case, clusters are merged when globally very 
close to each other, since the condition is on the 
farthest pair of observations. 

c): Average linkage: the inter-clusters dissimilarity be- 
tween the two clusters A and B can be defined as 
the pairwise dissimilarity between the average ob- 
servations for each clusters: 



D{A, B) =< {D{x, y) : X eA,y eB)> 



(9) 



This case is intermediate between the single link- 
age and complete linkage strategies. Clusters are 
merged when they are on average close to each 
other, i.e. most of observations of each cluster are 
close to each other. This strategy produces the 
stablest configuration because it is not sensitive to 
"outliers" . 

d): Ward's linkage: the inter-clusters dissimilarity can 
be defined as a measure of the increase of variance 
of the cluster obtained by merging the parents clus- 
ters compared to the sum of the variances of the 
two separate clusters: 



D{A,B)^ESSiA,B) 
where: 



[ESS(A)+ESS(B)] (10) 



ESS(A)=^ 



Na 



Na 



(11) 
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This linkage strategy provides compact and spher- 
ical clusters wh ich, intrinsic ally, have minimal in- 
ternal variance ( [Wardl[l963l ). 



2.2.3. Self Organizing Maps 

Self Orga nizing Maps (SOM) ( |Kohonen||1990[ [Vesanto 
& Alhoniemi 2000) are a constrained version of the K- 
means clustering. In SOM, the "prototypes", template 
observations determined on the basis of the initializa- 
tion of the algorithm, are encouraged to lie on a two- 
dimensional surface. This manifold is called a con- 
strained topological map, since all the observations in the 
original feature space are mapped to a two-dimensional 
coordinate system. A two dimensional grid of prototypes 
is "bended" by the SOM algorithm to adapt to the obser- 
vations as accurately as possible. Once reached optimal 
mapping, the observations can be mapped down onto 
the "prototypes", and each observation is assigned to 
the cluster represented by the closest prototype. Given 
the K prototypes and the i-th observation in the p- 
dimensional feature space Xi , the closest prototype rrij is 
picked using euclidean distance Q. In the simplest ver- 
sion of the SOM, the position of tne prototype is updated 
according to the rule: 



rrik 



nik + a{xi - TUk) 



(12) 



where a is a number called learning rate that changes at 
each iteration and, usually, goes from ^ 1 to over few 
thousands iterations. The positions of the prototypes are 
updated until the distance of each observation associated 
with prototype becomes smaller that a given "distance 
threshold" r. The distance threshold r decreases linearly 
with each new observation considered, according to the 
empirical rule followed to update the value of r. As in 
the case of the K-means clustering, the number of proto- 
types, i.e. of final number of clusters, must be specified 
by the user. SOM becomes an online]^ version of the 
K-means algorithm for small r distances, yielding only 
one observation associated with each prototype. The 
SOM algorithm can also be used as a supervised clas- 
sification and regression method, using the stable proto- 
types definitions obtained using a training, so that "new" 
sources will be associated with the closest prototype in 
the feature space. Because of their versatility, SOM have 
recently been applied to astronomical data to address 
distinct problems: the selection and classification of ex- 
tragalactic sources fro m large survey s data using their 
photometric attributes ( Geac hll2012l), the evalua tion o f 
photometric redshifts ( Geach 2012; Way & Klosc 12012), 
spectral classification ot stars (Ijazarghan||2012| ) and the 
reco nstruction of large scale structure ot galaxy distribu- 
tion dWay et al.|[20Tl ). 



2.3. The Score 

As discussed in Sec. [2] the originality of the CLaSPS 
method lies in the criterion used to select the most mean- 
ingful aggregations of sources in the feature space, which 
is based on the correlation with the labels, i.e. other ob- 
servables not used to produce the clustering. This cor- 

An online algorithm is one that can process its input in a piece- 
by-piece fashion, so that the whole input is not available from the 
start. 



relation is evaluated using a novel indicator, called the 
score. 

Each one of the methods described in the previous 
paragraphs provides us with one or more clusterings 
when applied to a given dataset. Each observation is 
uniquely associated with one of the clusters in the clus- 
tering (i.e., each observation belongs to one and only 
one cluster for each clustering). Additional information 
available for each source in the dataset, but not used for 
the clustering (i.e. not used to build the feature space) 
can be used to label the content of the clusters of each 
clustering. Categorial labels provide a natural binning; 
continuous labels, in our method, must be binned for the 
evaluation of the scores. The binning can be either a set 
of continuous intervals (for continuous labels) or a set 
of (single or grouped) values (for categorial labels). The 
distribution of labels values for the members of each clus- 
ter is used to determine the level of correlation between 
the label and the single cluster. The degrees of correla- 
tion between the label distribution and each cluster of a 
clustering are then combined to provide a measure of the 
degree of correlation of the label distribution with the 
clustering as a whole. 

The score provides a quantitive measure of the corre- 
lation between a label and the cluster membership for a 
given clustering. For any label L in the set of Nl labels 
available, a binning of L is represented by a set of Af^^^ 
classes, either quantitative intervals or categorial values 

{Cp\C^'^\...,C[^^}. The basic element of the score 
definition is the fraction of the i-th cluster members 
with values of the label falling in the j-th class: 



fij — 



^{U^cf^) 

N,, 



(13) 



where Li is the set of values of the label L associated 
with the members of the i-th cluster, nij is the num- 
ber of members of the i-th cluster with label values 
belonging to the j-th label class, and Ni is the num- 
ber of members of the i-th cluster of the clustering. 
Fij — {Fii,Fi2, Fi^j^(L) } can be defined as the arrange- 
ment of the fractions fij sorted in increasing ordeiR The 
score of the i-th cluster of a clustering composed oflVciust 
clusters relative to the label L can be defined as follows: 



Si=J2 11^.-^(^1) 



(14) 



Using the definition of score for a single cluster, the to- 
tal score of the clustering relative to the label L can be 
defined as: 



'5't( 



clust 



(15) 

where Si is the score evaluated for the i-th cluster of the 



clustering defined in (14 1. By definition, the total score 



^ A distinct arrangement of the fractions fij is determined for 
each cluster. 
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Stot and each single cluster score Si are normalized to 
unity. 

The weighted total score S^.^^ can be defined as the 
total normal score 5tot where the score of each cluster 
is weighted according to the number of sources of the 
cluster: 



1 E!ir^»x^»_ 1 E'^'rN.xs, 



clust 



clust 



(16) 

where A^tot is the total number of sources in the cluster- 
ing. 

The contributions of all clusters to the total Stot score 
are equally weighed. For this reason, S'tot is sensitive 
to small clusters with few members with a large degree 
of correlation with the label (for example singletons, i.e. 
clusters composed by only one observation). Instead, the 
contribution of each cluster to the weighted total score 
Stot is proportional to the ratio of its members to the 
total number of observations in the clustering. By def- 
inition, 5*^^^ is a declining function of the total number 
of clusters of the clustering. The weighted total score 
S'jQ^ is heavily influenced by the largest clusters and, as 
a consequence, is a measure of the "mass- weighted" de- 
gree of correlation of the dataset. Both S'tot and S'tot are 
used to select the optimal clusterings because they rep- 
resent complementary measures of distinct aspects of the 
clusterings, namely the level of correlation of the largest 
clusters and of the existence of less populated groups of 
sources. 

2.3.1. Score assessment 

Before applying the scores to the real astronomical 
datasets described in Sec. [3l and Sec. [4j we have tested 
the effectiveness of this method with simulated cluster- 
ings. In these simulations, we assume that the final 
structure of a generic clustering is independent from the 
unsupervised clustering algorithm used to produce the 
clustering. This assumption is reasonable because the 
algorithm depends on the topological relations among 
the sources of the dataset in the feature space where the 
clustering has been generated and on the linki ng strat- 
egy used to associate the sources (see Sec. 2.2 for more 
details). Both the properties of the clustering algorithm 
and the actual values of the features and labels associ- 
ated with the simulated sources are of no importance in 
an idealized description of the clustering, where topo- 
logical and relational properties of the observations are 
condensed in the membership, a categorial information, 
for each source. On these premises, the fundamental pa- 
rameters describing simulated clusterings are: the total 

number A^^o™' observations of the sample of the simu- 
lated clustering; the number of clusters in the clustering 

-^ciust^' number of members of each cluster, normal- 
ized to the total number of observations in the sample 



(sim) 



where i G {!> •■•) -^iiust^i associated with the 
spread of the sizes of the clusters measured with the 

the number of classes of the label 

and a prescription to assign the label values to 
the cluster members. 



variance a 



M 



(sim) _ 



Table 2 

Ranges of the parameters of the simulated clusterings used to 
validate the effectiveness of the score definitions in capturing the 
degree of correlation between label value classes and clusters 
membership distributions 



Parameter 


Valuc{s) 




1000 




[30, 200] 


dust 


[3, 12] 




[2, 10] 




Type of clustering 


% not-random/random 


Weakly /not corr. 


[0%, 30%], [100%, 70%] 


Moderately corr. 


[30%, 70%], [70%, 30%] 


Strongly corr. 


[70%, 100%], [30%, 0%] 



Three different scenarios have been considered in order 
to create realistic simulated clusterings. These scenarios 
have inspired distinct association rules between classes 
of label values and observations in the clusters that have 
been used to generate the simulated clusterings. These 
scenarios are: a) label values belonging to any label class 
are randomly associated with the observations, regard- 
less of their membership; b) label values of each label 
class are assigned only to sources belonging to a fixed 
small number of randomly selected clusters; and c) label 
values of each label class are assigned to observations in 
only one randomly selected cluster in the clustering. 

We produced simulated clusterings with different de- 
grees of correlation between label values and clusters 
memberships by mixing the previous three prescriptions 
in different percentages. Thus, we obtained "recipes" 
to simulate weak, moderate and strong correlations in 
clusterings. For the fraction of class label values not ran- 
domly associated, each value was assigned only to sources 
belonging to a randomly picked number of clusters, with 
one, two or three clusters being the most likely options 
by definition. The remaining fraction of label values was 
randomly assigned to sources of the clusterings indepen- 
dently from their cluster memberships. 

The weakly correlated clusterings have a 0%-30% not 
randomly assigned class, the moderately correlated clus- 
terings have 30% to 70% and the extremely correlated 
clusterings have 70% to 100%. These intervals have been 
selected in order to verify the effectiveness of the score to 
express the level of correlation among label distribution 
and cluster memberships in realistic scenarios where the 
classes are partially correlated with a subset of clusters, 
and in the extreme cases with total correlation (100% of 
not randomly assigned classes) and no correlation (0% 
of not randomly assigned classes), as a function of the 
parameters of the simulated clusterings. We produced 
equal numbers of simulated clusterings for each of the 
three prescriptions described above. All the parameters 
of the simulations were free to vary in the intervals de- 
scribed in Table [2j where also the composition of the 
three classes of clusters is summarized. 



Bot h n ormal scores (Eq. (15l) and weighted scores 
(Eq. (16)) were evaluated in each family of simulated 
clusterings. The histograms of the distributions of values 
of Stot and S'tot for the simulated clusterings are shown 
in Figure [l] The scatterplots of the values of the scores 
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as functions of the parameters of the simulations 



(sim) 



^ciust'' ^r""' and a^{Nr''') are shown in Figure |2l 

From the differential and cumulative histograms shown 
in the plots in Figure[lJ it is evident that the normal score 
Stot (filled bars) spans the whole range [0, 1] with the 
values of the extremely correlated simulated clusterings 
ranging from 0.6 to 1, the moderately correlated with 
scores values between ~ 0.4 and 0.7, and weakly corre- 
lated clusterings have values of the total normal scores 
smaller than 0.5. In the same plots, the values of the 
total weighted scores S^^^ (dashed bars) are consistently 
smaller than 5'to t an d are not normalized to unity (as 
remarked in Sec. 2.3 1. The weighted total scores S^^^ for 



the three families ot simulated clusterings are separated 
less clearly than in the case of the normal total scores, 

as the weights depend on the sizes of the clusters N^^"'^'' 
and, consequently, the total value of the score depends 

on their variance a 

(Tvf-)) of the sizes of the clusters. 
Even so, the strongly correlated clusterings are associ- 
ated with values of S^^^ on average larger than the S^^^ 
values for partially correlated simulated clusterings and 
randomly drawn clusterings. The four plots in Figure [2] 
show that the values of the scores do not show significant 
dependencies over any of the four parameters describing 



(sim) 



N. 



(sim) 



clust ' 



M 



(sim) 



and 



the simulated clusterings: 

(T^ ( A^j^'^'™'' ) . The results of the simulations demonstrate 
that the scores as defined in this paper, are unbiased di- 
agnostics of the degree of correlation between the distri- 
bution of observations in a clustering and the label class 
values. 



2.4. Choice of Clustering 
Given a label L and a set of label classes Ml, CLaSPS 
produces distinct values of the total scores Stot and Stot 
for each clustering prod uced by any clustering method 
employed (see Sec. 2.2 1. The clusterings produced by 



a single method differ for the total number of clusters 
-^ciust- The scores can be plotted as a function of the 
number of clusters and the clustering method, as shown 
in the left side of Figure [3j to identify the clusterings 
with the largest degree of correlation between the label 
classes and the clustering members. A similar plot can 
be used to immediately determine the clustering with 
the largest correlation between the features distribution 
of the clusters and each of the whole set of labels at once 
(right side plot in Figure [s]). 

Since the total scores are averaged over all the clusters 
in a given clustering, they can only provide information 
on the global degree of correlation of the distribution of 
label classes and the clustering membership. Information 
about the local correlations is carried by the value of 
the scores for each cluster contained in the clustering 
separately. 

For a given label L and a set of clusterings produced by 
the same clustering algorithm but with different number 
of clusters, the values of the scores and the number of 
members of each cluster is shown in the "heatmap" plot 
on the right side of Figure |4] This specific type of plot is 
useful to select large cluster score values that may not be 
reflected in the global scores, which are averaged over all 
clusters of the clustering (see equations (15) and (16l). 



The left side plot in Figure |4] shows the values of the total 
normal and weighted scores for distinct clusterings pro- 
duced by a given clustering method as a function of the 
total number of clusters iVciust, for a whole set of labels. 
These plots can be used to determine whether multiple 
labels show similar trends in their degrees of correlation 
with the distribution of members of the clusters of each 
clustering. In this way, correlated attributes can be se- 
lected on the basis of the result of the clustering and 
labeling procedure and their dependences can be taken 
into account during the interpretation of the results. 

2.5. Uncertainties 

The uncertainty on the features of a dataset can affect 
the result of the clustering and the selection of correla- 
tions among the cluster distributions and the labels. The 
clustering methods discussed in Sec. |2.2| do not take into 
account the presence of uncertainties on the attributes. 
The effect of the errors on the features can be evaluated 
by applying CLaSPS to multiple realizations of the same 
dataset obtained by "perturbing" the features distribu- 
tion and evaluating the spread of the scores distribution 
relative to the different clusterings. 

In the case of the experiments described in this pa- 
per, multiple realizations of the dataset feature distri- 
butions have been obtained by assuming that the error 
tTxi on the i-th feature can be interpreted as the stan- 
dard deviation of a normal error distribution. While this 
is a reasonable assumption for the uncertainties on the 
photometric quantities from large area surveys like the 



SDSS (e.g., Fukugita et al. 1996) from which the fea 



tures of the datasets discussed in this paper have been 
extracted, this method is general. For example, if the un- 
certainties on distinct features have to be modeled with 
distinct distributions, the perturbations can be indepen- 
dently extracted for every single feature, according to 
the same procedure described in the following for gaus- 
sian distributions. For the experiments discussed in this 
paper we randomly extracted a distinct perturbing num- 
ber Pi from a gaussian distribution centered around zero 
and with width equal to twice the uncertainty on 
the value of the attribute Xi, for any given source oi^ the 
dataset. The new realization of the i-th feature Xi is 
defined as follows: 



>X^^Xi+p^ 



(17) 



where Pi can be positive or negative: Pi G [—<y^i,cf^^^. 
This approach, in general, can be time-consuming as it 
requires CLaSPS to be run multiple times on slightly 
different realizations of the same multi-dimensional fea- 
tures distribution. However, the high dimensionality of 
the feature space where clustering methods are applied 
usually guarantees that the results of the clustering are 
robust. This statement can be verified by observing the 
distribution of total and cluster scores values for several 
distinct realizations of the dataset obtained by perturb- 
ing the values of the features as described above (see 
Figure [5] for an example of the distribution of the val- 
ues of the clusters and total scores distributions for 50 
realizations obtained with the above procedure). 

In the case of the datasets discussed in this paper and 
described in Sec.|3]and[4j arbitrary thresholds of 5% and 
10% variation over the mean value of each total and clus- 
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Figure 1. Left: the normalized histograms of the values of weighted and unweighted scores for the three classes of simulated clusterings 
(gray: weakly correlated clusterings, blue: moderately correlated clusterings, red: extremely correlated clusterings) are shown as solid and 
open bars respectively. Right: cumulative distributions of the scores for the three classes of clusters are shown. 



ter scores respectively have been set to evaluate the sta- 
bility of the clusterings. The results obtained confirmed 
that all clusterings are insensitive to the "perturbations" 
to the values of the features within these values of the 
threshold. This result was expected, since the datasets 
considered in this paper are sparse in the feature space 
where the clustering methods are applied, leading to in- 
trinsically stable clustering configurations. 

3. APPLICATION TO THE CSC-I- DATASET 

The CSC-f is an example of the class of inhomoge- 
neous datasets that have become common thanks to the 
emerging VO technology. As discussed in Sec. [T] KD 
methods can improve the extraction of useful correla- 
tions from such samples, by minimizing the influence of 
biases and selection effects inherent to federated data. 
In this specific case, we will show that the application of 
the CLaSPS method leads to the determination of simple 
well known relations between observables from distinct 
spectral regions. 

3.1. The CSC+ dataset 

CSC+ is a sample of spectroscopically selected optical 
quasars with X-ray observa tions in the Chandra Source 
Catalog ( jEvans et al. 2010) (CSC) for which additional 
multi-wavelength photometric data are available. These 
sources have been cl assified as quasars u sing the Sloan 
Digital Sky Survey ( [Aihara et al]|2011[ ) (SDSS) spec- 
troscopic observations! In addition, the sources of the 
CSC+ sample have been selected so that both near- 
infrared and ultraviolet photometric observations can 
be r etrieved from the UKI RT Infrared Deep Sky Sur- 
vey ( [Lawrence et al.||2007l) (UKIDSS) and the Galaxy 
Evolution Explorer ( |Martm et al.||2005[ ) (GALEX) cata- 
logs respectively. 



The GALEX and UKIDSS counterparts to the sources 
and upper limits in both the CSC and SDSS surveys 
have been determined using pre-selected crossmatched 
catalogs containing all sources detected in the SDSS and 
in each of the two datasets discussed. More specifically, 
we have used the SDSS-GAL EX crossmatched sample of 
sources ( Budavari et al. 2009) to determine the UV coun- 
terparts oTTEie^SnSB^handra sources, and the cross- 
matched table of the UKIDSS counterparts of the SDSS 
stellar sources for the IR photometry, available through 
the web interface to the GALEX database. 

The total number of sources of the CSC-I- sample is 
112 when considering only detection in the CSC (dataset 
used in experiments 1 and 2 of Table [3| and 192 includ- 
ing all sources with reliable Chandra upper limits for the 
flux in the Broad Chandra energy band as returned by 
the CSC sensitivity map service (dataset used in exper- 
iment 3). The final CSC-I- sample is composed of radio- 
quiet quasars, except for two source s that can be found 
in the VLA FIRST Survey Catalog ( [Becker et ar][l995) . 
More details on the specific data used to build the set of 
features and labels of the CSC-I- sample described above 
and listed in Tableware discussed be low. 

CSC, the Chandra Source Catalog (Evans et al. |2010 l 
includes ~ 1.06 x 10^ unique unresolved or slightly ex- 
tended X-ray sources with 5-band photometry. The total 
cumulative sky coverage is 320 deg2, but since the major- 
ity of sources have broad band fluxes of ^-^10-14 cgs, the 
effective coverage is ~260 deg^. The sensitivity varies in 
different regions. A catalog co ntaining CSC-SDS S po- 
sitionally cross-matched sources( Evans et al. 2010) cov- 
ers ^133 deg^, including ~ 1.7x10'' Chandra sources, 
of which ^9000 have stellar and ^-^7800 extended optical 
counterparts, mostly galaxies. 

SDSS DR8 ([Aihara et al.|201l|) has observed ~ 1.4x 10"* 
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Figure 2. Plots of the score values for simulated clusterings as functions of the general parameters of the simulated clusterings. In all 
plots, the gray, blue and red symbols are associated with the scores evaluated for the three families of clusterings with weakly, moderately 
and extremely correlated association of the values of label classes to the members of the clusters. Top left: the values of the scores are 

plotted as functions of the number of clusters N^j™j.' of the simulated clusterings; top right: score values as function of the number of 



classes of the labels M 



(sim) 



Bottom left: score values plotted as function of the total number of observations 



(sim) 



of the simulated 



clusterings; bottom rigfet: the score values as function of the variance of the sizes of the clusters belonging to the simulated clusterings. 
In all plots, weighted and normal scores are represented by open and filled symbols respectively, know while the continuous lines show the 
average and to the moving window medians of the scores distributions for each class of simulated clusterings, in the upper and lower plots 
respectively. 



deg^ of the sky in 5 bands ugriz, with photometric lim- 
iting magnitude of 22.2 in the r band (95% complete- 
ness for point sources). It includes spectra of ^ 1.8x10^ 
sources in the 380-920 nm wavelength range. Classifica- 
tion in quasar, high-redshift quasars, galaxy, star and 
late-type stars classes and spectroscopic redshifts are 
available for these spectroscopically observed sources, 
based on the measured lines of the optical spectra. If 
emission lines are observed and if the source has a final 
redshift larger than 2.3, it is classified as a high-redshift 
quasars. 

UKIDSS (Lawrence et al. 2007[ ) has been designed 
to be the infrared counterpart to the SDSS and covers 
^-^7500 deg^ of the sky in JHK near-infrared bands to 



K—l^.i. The Large Area Survey (LAS) has imaged ~ 
4000 deg^ (overlapping with the SDSS), with the addi- 
tional Y band to a limiting magnitude of 20.5. The final 
area of the overlap between CSC and UKIDSS LAS will 
be ~50 deg^. 



GALEX ( Martin et al. |2005[ ) is a 2-band survey (far 
and near Uv) that has observed the whole sky up to a 
limiting magnitude nuv=20.b. It includes deep fields to 
magnitude 25 with spectroscopic observations. 

3.2. CSC+ dataset: features and labels 

Three different experiments have been performed on 
the CSC-I- sample using distinct combinations of fea- 
tures for the clustering and labels for the evaluation of 
the scores. The features have all been extracted from the 
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lab. 1 K-means 



OD 





Figure 3. Plots of the normal and weighted total score distributions for a generic dataset for distinct total number of clusters Afdust of 
the clustering and type of unsupervised clustering algorithm used to produce the clusterings. Left: the total scores distributions, evaluated 
for a specific label L, are plotted as a function of the total number of clusters of the clusterings for all unsupervised clustering algorithms; 
right: both total score distributions are plotted as a function of the total number of clusters A^clust of th^ clusterings for multiple labels. In 
both plots, the total normal scores Stot and weighted scores S^^^ are plotted separately for the sake of clarity. 
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Figure 4. Example of "heatmap" plots for a generic dataset and set of clusterings extracted using a given clustering method for one set 
of labels and one particular label class distribution. Left: each cell represents a whole distinct clustering obtained using a given clustering 

algorithm, and the value Stot and S^^^ (in square brackets) of the total normal and weighted scores respectively are reported for a set of 
labels. Right: each cell (except for the cells in the upper row) represents a cluster and contains the cluster score value and the number 
of members of the cluster (in parentheses). The upper row shows the total normal and weighted (in square brackets) score values of each 
clustering. 
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Figure 5. Total scores Stot and S^.^^ distributions for 50 realiza- 
tions of the features of a generic dataset, as described in Sec. |2.5| 
(solid points). In the upper panel, the cluster and total normal 
scores distributions for 4 distinct clusters and their clustering are 
plotted in full and open points respectively. In the lower panel, 
the weighted total scores for the same clusterings are plotted. The 
±5% and ±10% intervals around the mean values of the total and 
clusters scores are drawn for reference in both panels. The vari- 
ations of the cluster normal scores can reach 10% of their unper- 
turbed values (open circles in the upper panel), but the variations 
in the total normal and weighted (solid circles in upper and lower 
panels) scores barely reach 5% of their values. This fact shows the 
robustness of the total scores values relatively to the presence of 
uncertainties on the features. 



overall set of colors obtained from consecutive photomet- 
ric filters, while the labels are either photometric mea- 
surements (not used for the clustering, in these cases) or 
the spectroscopic redshifts, classification flags and other 
parameters related to the shape of the spectral energy 
distribut ions of the sources. For ex ample, the aox pa- 
rameter (Avni & Tananbaum 1982) has been used as a 
label 



The aox parameter measures the relative amount 
of energy emitted in the optical and X-rays and is defined 
as the spectral slope between the optical/UV and X-rays 
monochromatic luminosities at E = 2 KeV and A = 2500 
A respectively: 



lo: 



aox ^ 



log(i/opt)-log(j^x) 



-1 



(18) 



The features of the experiments performed using the 
CSC+ sample are described in detail in Table [3j 

3.3. Results of the application of CLaSPS to CSC+ 

The main results of the application of CLaSPS to the 
two different datasets based on the CSC-I- sample are 
summarized in the plots in Figures IT] and [6j 

In the first experiment (see Table |3]), we have found 
a significant correlation between the near-infrared, opti- 
cal and ultraviolet colors, used as features, and the aox 
index for two clusterings, composed of 4 and 5 clusters 



respectively and produced using the K-means and SOM 
methods (see leftmost plot in the upper row of Figure [6]). 
Even if the total normal and weighted scores values lor 
the clustering composed of 5 total clusters are smaller 
than the scores for the other two clusterings, this cluster- 
ing has been considered more interesting because of the 
larger number of sources contained in clusters with signif- 
icantly large values of the cluster scores. We will discuss 
here the correlation involving the members of the second, 
third and fourth clusters of the clustering composed of 
five total clusters (see upper-left plot in Figure l6|. 

In order to determine whether there is a subset of fea- 
tures responsible for th e correlation ob served, a Principal 
Component Analysis (Hartigan 1975[ ) (PCA) has been 
performed on the feature distribution of these three clus- 
ters. The PCA finds that the correlation is mostly due 
to the blue optical/UV colors nuv — u and u — g. 

The correlation between the optical blue and near-UV 
features of quasars and the aox spectral index has been 
discussed in several papers in the literature (see, for ex- 



ample Vignali et aLI 2003 Lusso et al. 2010|), and can 



be explained on the basis of the d efini tion of the aox 
spectral index itself (given in Sec. 3.2 1, and the char- 
acteristics of the spectral energy distributions (SED) of 
homogeneous samples of radio-quiet quasars observed in 
the X-rays. In particular, the presence of this correla- 
tion is usually associated with the presence of a promi- 
nent component of the SED of the quasars at near-UV 
wavelengthSjCalled "big-blue-bump" . 

In Figure^ we show the Lopt(2500A) vs aox distri- 
bution of the sample used in the first experiment for the 
clustering composed of five clusters. The projections of 
the five clusters are plotted as shaded colored regions and 
closed black lines for the three clusters with large score 
values (clusters one, three and five in the upper-right plot 
in Figure Jg]) and the remaining two clusters (clusters two 
and four m the upper-right plot in Figure |6]) respectively. 

The clusters with large scores are those for which the 
correlation between the optical mono-chromatic luminos- 
ity and the aox index is more significant. In fact, while 
the significance of finding a correlation between the two 
parameters for the whole sample is low 43%), the sig- 
nificance is larger for the subset of points of the three 
clusters selected 90%). In Figure [tI the size of the 
symbols are proportional to nuv — u color values of the 
sources, and it is evident that the members of the corre- 
lated clusters have, on average, lower values of nuv — u. 
This suggests that the SEDs of the sources belonging 
to the three clusters with large scores values are domi- 
nated by the "big blue bump" component (the red and 
black lines are associated with the best fitting linear rela- 
tions for the members of the three clusters and the total 
sample respectively, and are shown only for reference). 
In this regard, we conclude that, despite the fact that 
the CSC-f sample used for this experiment is highly in- 
homogeneous, CLaSPS select a subset of clusters whose 
members show a significant degree of correlation between 
the optical mono-chromatic luminosity Lopt(2500A) and 
the aox spectral index. The behavior of this subset of 
sources, is in agreement with the results found and dis- 
cussed in literature for homogeneously drawn samples of 
X-ray emitting radio-quiet optically selected quasars. 

A similar correlation pattern, weaker than the one ob- 
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Table 3 

List of the experiments performed on the CSC+ sample. A short description of datasct, the number of sources of the dataset, the total 
number of clusters for the clustering produced and the list of features and labels used are provided for each experiment. In the Labels 
column, each label is followed, in curly brackets, by the binning used to evaluated the scores. For categorial labels, the binning is specified 
by providing the actual values corresponding to the distinct classes; for continuous labels, the extremes of the bins defining the classes are 

provided. 



Experiment 


Datasct 


^ sources 


^ clusters 


Features 




Labels 


Exp. 1 


SDSS quasars with 
CSC detection 


112 


{3, 4, 5} 


f uv-nuv ,nuv-u ,u 
g-r ,r-i,i-z,z-Y , 
Y-J,J-H,H-K 


a, 


Z3poc{0.4,l.l,1.9},ffii(ms){-0.4,-0.2,0}, 
HR{hm) {-0.1, 0,0. 2}, 
L(B){2,4,6,8}xlO*^ org 
aox{1.3} 


Exp. 2 


SDSS quasars with 
CSC detections 


112 


{3, 4, 5} 


f uv-nuv, nuv-u,u 
g-r,r-i,i-z,z-Y , 
Y-J,J-H,H-K, 
HR(ms),HR{hm 


9, 


Z3pcc{0.4,l.l,1.9}, 
QOx{1.3}, 

L(S){2,4,6,8}xlO*'^ erg s^^ 


Exp. 3 


SDSS quasars with 
CSC detections and 
upper limits 


192 


{3, 4, 5} 


f uv~nuv ,nuv-u ,u 
g-r ,r-i,i-z,z-Y , 
Y-.J,J-H,H-K 


9, 


Z3pcc{0.4,l.l,1.9}, 
aox{1.3},/xdot{0,l} 
L(S){2,4,6,8}xlO''^ erg s"^ 



served for the aox index though, is observed in the dis- 
tribution of score values for the hardness ratio HR{hm) 
used as label (see central plot of the upper row in Fig- 
ure [6]). 

In the second experiment (Exp. 2 in Tablejs]), the hard- 
ness ratios HR{hm) and HR(ms) have been used as fea- 
tures, together with the other variables used as features 
in the first experiment. Correlations similar to those ob- 
served in the first experiment are observed for in the 
clusterings with 4 and 5 total clusters each in the second 
experiment (see central and left plots of the central row 
in Figure pi). 

The PCX of the feature distribution of the clusters with 
larger values of the scores showed that the correlation can 
be mostly attributed to a subset of features including the 
nuv — u and u — g colors and both the X-rays hardness 
ratios. 

In the third experiment (Exp. 3 in Table p|, we have 
used as label the values of upper limits for A-ray lumi- 
nosity and considered them as detections in order to test 
whether the distribution of not-X-ray colors of the sam- 
ple correlates with either the detections or the upper lim- 
its observed in X-rays. As shown in the plots of the lower 
row in Figurc[6j no interesting correlations among the set 
of features and the labels considered is visible. In par- 
ticular, the X-ray detection flag is not correlated with 
the near-infrared, optical and ultraviolet colors used as 
features, regardless of the clustering methods and total 
number of clusters of the clusterings. 

4. APPLICATION TO THE BLAZARS DATASET 

CLaSPS has also been applied to a sample of blazars 
with available photometric data spanning from the mid- 
infrared to optical wavelengths, with additional informa- 
tion in the 7-ray spectral range. Blazars are a pecu- 
liar family of AGNs characterized by rapid variability at 
all frequencies. The other distinguishing observational 
properties of blazars include flat radio spectra, high ob- 
served luminosity and highly variable radio to optical po- 
larization. Blazars are a dominant class of extragalactic 
sources at radio, microwave and 7-ray frequencies. The 
observational characterization of this class of galaxies is 
interesting as a tool to shed some light on the physical 
mechanisms responsible for the emission. The experi- 
ments described have aimed at the characterization of the 
blazars population in the infrared bands, extending the 



type o f analysis alread y performed on 2M ASS ( Skrutskie 



2006^ data (e.g. jChen et al.||2005l) to longer wave 



et al. 

lengt hs, using the recently released WISE ( [Wright et al. 
2010 1 mid-infrared photometric data. 



The Blazars 
BZCAT 



4.1. The Blazars dataset 
sample is based on 



the ROMA- 

assaro et al. 20091) list of blazars. This cat- 



alog assembles blazars known in the literature and con- 
firmed by the inspection of their multi-wavelength emis- 
sion. The members of the ROMA-BZCAT catalog are 
selected on the basis of a set of criteria involving the 
presence of detection in the radio band down to 1 mJy 
flux density at 1.4 GHz (2.1 /im), the optical identifica- 
tion and availability of an optical spectrum for further 
spectral classification and the detection of isotropic X- 
ray luminosity Lx > 10^'^ erg xs~^. Such criteria do 
not yield a statistically homogeneous or complete sam- 
ple of blazars but provide the largest and more carefully 
selected sample of confirmed blazars available to date. 
In the ROMA-BZCAT, blazars are also divided in three 
spectral classes, based on the prominence of the emission 
features in the optical spectra of these sources: BZB for 
the BL Lac sources, i.e. AGNs with featureless optical 
spectra and narrow emission lines; BZQ for flat-spectrum 
radio quasars with optical spectra showing broad emis- 
sion lines and typical blazars behavior; BZU for blazars 
of uncertain type, associated with sources with pecu- 
liar characteristics but also showing typical traits of the 
blazars (a more detaile d description of on the Blazars 
sample can be found in (Massaro et al.pOll D'Abrusco 



et al 
from 



20121)). 



This sample include s ~ 80U 7-ray sources 



LG Nolan et al.] ("2012) associated with mem- 
bers of the catalog to a high level of confidence, and 571 
of these blazars are also present in the ROMA-BZCAT. 
More details on the specific data used as features and la- 
bels of the Bl azars sample can be found below (for SDSS 

data, see Sec. |3.1|). 

2MASS (Skr utskie et al.||2006] ) has uniformly scanned 
the whole near-infrared sky m three bands H, J and 
Ks detecting points sources brighter than ~ 1 mJy in 
each filter, with with positional accuracy of 0.4", to a 
magnitude limit (for stellar sources in unconfused regions 
and outside of the gal actic plane) of 15.8 i n the J band; 

The WISE mission (Wright et al. 2010) has observed 
the entire sky in the mid-intrared spectral interval at 3.4, 
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Figure 6. Results of the application of the CLaSPS method to the datasets based on the CSC+ sample (for details see Sec.^. In the 
upper row, plots from the Exp.l are shown. From left to right: total scores distribution for the clusterings obtained with theK-means 
method; cluster scores distribution for the label X-ray ha rdne ss ratio HR{hm)\ cluster scores distribution for the aox index. The right 
plot, in particular, has been discussed extensively in Sec. |3.3| In the central row, plots from the results of Exp. 2 are showed. From left 
to right: total scores distribution for the clusterings obtamed with the K-means method; cluster score distribution for the total X-ray 
luminosity used as label; cluster scores distribution for the aox index. In the lower row, the plots from the Exp. 3 are shown. From left to 
right: total score distributions of the clusterings obtained using the SOM method; cluster scores distribution for the lab el re presented by 
the X-ray detection flag; cluster scores distribution for the aox index (for more details, see Sec.jsl. The discussion in Sec. |3.3| and FigurelT] 
are based on the clustering with five total clusters of the second experiment (mid-left plot in tms figure), using as label the oqx spectraT 
index. 

4.6, 12, and 22 /xm with an angular resolution of 6.1", 
6,4", 6.5" & 12.0" in the four bands, achieving 5a point 
source sensitivities of 0,08, 0,11, 1 and 6 mJy in uncon- 
fused regions on the ecliptic, respectively. The astromet- 
ric accuracy of WISE is ~ 0.50", 0,26", 0.26", and 1,4" 
for the four WISE ban ds respectivel y. 



The 2FLG catalog ( jNolan et al,| 2012[ ) contains pri 



marily unresolved sources detected In the all-sky Fermi 
observations obtained throughout the second year of op- 



eration. The sources, after detection and the localiza- 
tion in the sky, are assigned, among other parameters, 
an integrated flux in the 100 MeV to 100 GeV energy 
range, a spectral shape and a significance parameter TS 
based on how significantly each source emerges from the 
background. Only sources with TS > 25, correspond- 
ing to a significance of 4cr, have been included in the 
catalog. Each of the 1873 2FLG sources have been con- 
sidered for identification with already known astronom- 



14 



R. D'Abrusco et al. 



dataset: CSC+ exp: expl label: aox 



2.0 



1.5 



1.0 



11 

Lr L 

P-cP -H-i-i, n ' * ^ 


-10 12 

nuv-u J 


3 4* ^--^ 
••.-^"^ ♦ ♦ 


-"""^ ♦ * 
♦ 

♦ • 

1 , . , . 1 . 


♦ \ 

. , 1 , . , . 1 . , . , 1 



28 



29 30 31 

log[L„p,(2500 A)] [erg s"^ Hz" 



32 



Figure 7. Distribution of the sources in the first experiment 
(Exp. f) with the CSC+ sample in the Lopt(2500A) vs oqx plane. 
The shaded regions containing the red symbols correspond to the 
projections of the three clusters (clusters one, three and five shown 
in the upper-right plot in Figure [6| with large scores values for 
the label «ox iii the first experiment (Exp. 1 in Table [3|l. The 
two black polygons represent the projections of the remaining two 
clusters (clusters two and four in the mid-left heatmap in Figurepll 
with small scores values. The size of the symbols used for the plots 
is proportional to the near-ultraviolet color nuv — u of each source, 
and in the inset of the plot the histograms of the nuv — u color 
distribution for sources belonging to the three interesting clusters 
and the two unselected clusters is shown. The red and black lines 
represent the linear regression for the points in the correlated clus- 
ters and the whole sample, respectively. The red line is in perfect 
agreement with the best-fit relation from l |Lusso et al.|2010| (green 
line), derived from a sample of 545 AGNs. 

ical sources available in literature multi-wavelength ob- 
servations. For 127 of the 2FLG sources firm identifica- 
tions have been produced (namely, reliable identifications 
based on synchronous periodic variability of the sources, 
coincident spatial morphologies for extended sources or 
correlated aperiodic variability). The remaining sources 
have been investigated for association with sources con- 
tained in a list of source catalogs based on different multi- 
wavelength observations. The BZCAT (Massaro et al. 
[2009) ) catalog is one of the catalogs used tor the associ- 
ation of the 2FLG sources, and 571 2FGL sources have 
been associated with a BZCAT blazar. The 7-ray detec- 
tion flags used in the experiments described in this paper 
are b ased on the official associations of the 2FLG sources 
from ( [Nolan et aL]|2012[ ). 



4.2. Blazars dataset: features and labels 

In the first experiment, we have used as features 
the colors calculated with consecutive filters from mid- 
infrared (WISE) to the optical (SDSS). As labels, we have 
used the spectroscopic classification in BZB, BZQ and 
BZU from the ROMA-BZCAT catalog ( [Massaro etH" 
20091, the radio flux density at the frequency v = \A 
GHz and a 7-ray detection flag. Such flag is equal to 1 



at ed wit h a 7-ray source from 2FGL catalog (Nolan et 
ar|2012), and for all the other sources. 

In the second experiment, the labels used for the first 
experiment Exp. 1 in Table [4| have been complemented 
by WISE colors, not used a&jeatures in this case, while 
the Fermi detection flags have not been used as label be- 
cause this sample is composed of all the blazars of our 
sample associated with Fermi detections. In the third ex- 
periment, only the optical and near-infrared colors from 
SDSS and UKIDSS respectively have been used as fea- 
tures, while the WISE mid-infrared colors and the Fermi 
detection flag have been added to the set of labels already 
used in the first experiment for all blazars, regardless of 
their association with Fermi detections. The parameters 
of the two experiments are shown in details in Table [4| 

4.3. Results of the application of CLaSPS to Blazars 

dataset 

The results of the application of CLaSPS to the three 
experiments based on the Blazars sample and described 
in Table [4] are shown in the plots in Figure [8] 

The main conclusion that can be drawn Irom the re- 
sults of the first experiment (plots in the upper row 
in Figure [8| , is that the distribution of blazars in the 
optical-|-near-infrared-|-mid-infrared colors feature space, 
consistently throughout the distinct clustering methods, 
strongly correlates with the spectral classification of the 
Blazars. The correlation is noticeable, based on the large 
values of the total normal and weighted scores evaluated 
using the blazars spectral class as label for all clustering 
(upper-mid plot in Figure [8|. In particular, the largest 
total normal and weighted score values for this label are 
both obtained for the clustering with total three clusters 
and produced with the K-means algorithm (upper-mid 
plot in Figure [sj). 

In order to verify whether there is a smaller subset of 
features responsible for the correlation, a PCA was car- 
ried out on the feature distribution of the two clusters 
with more than one member, of the clustering with three 
total clusters produced by the K-means with the (BZB, 
BZQ, BZU) spectral classes used as labels. This analysis 
has shown that the spectral classification of the sources 
of the Blazars sample correlates very strongly with the 
mid-infrared colors from WISE. The projection of the 
feature space distribution of the sources in the Blazars 
sample onto the WISE [4.6] - [12] vs [3.4] - [4.6] color- 
color space is shown in the left plot in Figure [9| In this 
plot, the sources are plotted with different symbols ac- 
cording to the value of the label, the spectral class (BZB, 
BZQ or BZU), and the three regions occupied by the pro- 
jections of the three clusters of the clustering with total 
three clusters are represented as shaded co lored areas. 
Th is fin ding has been discussed in detail in (Massaro et 
al. 2011p , where an explanation of the new correlation 
Has been proposed, in terms of the currently accepted 
emission mechanisms of blazars. 

The same correlation between the distribution of fea- 
tures of the clusterings and the spectral classification has 
been observed in the second experiment. In this case, 
the dataset used contains only blazars from the ROMA- 
BZCAT catalog which have been associa ted with 7-ray 

Similarly at 

what found for the 



sources in the 2FLG ( [Nolan et al.||2012[ ) 

second experiment (Exp. 2 in Ta- 



for the ROMA-BZCAT sources than have been associ- ble[4|), the correlation can be almost entirely ascribed to 
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Table 4 

Characteristics of the experiments performed on the Blazars sample. For a detailed description of the columns, refer to the caption of the 

Table H 



Experiment 


Datasct 


^ sources 


clusters 


Features 


Labels 


Exp. 1 


BZCAT blazars with 
FIR, NIR and Optical 
photometry 


241 


{3, 4, 5, 6} 


u-g^g-r^r-i, 

i-z,z-J,J-H, 

H-K, K-l3.4],l3.4]-l4.6], 

[4.6]-[12],[12]-[22] 


source class{BZB, BZQ, BZU}, 
/(1.4GHz){10^,3xlO^} 


Exp. 2 


7-ray detected 
BZCAT blazars with 
FIR, NIR and Optical 
photometry 


241 


{3, 4, 5, 6} 


u-g^g-r^r-i, 

i-z,z-J^J-H, 

H-K, K-l3.4],l3.4]-l4.6], 


source class{BZB, BZQ, BZU}, 
/(1.4GHz){10^,3xlO^}, 


Exp. 3 


BZCAT blazars with 
NIR and Optical 
photometry 


241 


{3, 4, 5, 6} 


u-g,g-r^r-i^ 

i-z,z-J^J-H, 

H-K 


source class{BZB, BZQ, BZU}, 

/(1.4GHz){10■^3xl03}, 

/Tdot{0,l},[3.4]-[4.6]{0,0.5,1, 

1.5,2},[4.6]-[12]{0,1,2,3,4,5}, 

[12]-[22]{0,1,2,3,4} 
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Figure 8. Results of the application of the CLaSPS method to the datasets based on the Blazars sample (for details see Sec. |4]l. In the 
upper row, plots from the Exp.l are shown. From left to right; total scores distributions for clustering obtained with the K-means method; 
cluster scores distributions for the spectral class of the blazars used as label; cluster scores distributions for the 7-ray detection flag used 
as label. In the lower row, plots from the results of Exp. 3 are shown. From left to right: total score distributions for clusterings created by 
the K-means method; cluster scores distribution for the [3.4] — [4.6] WISE color used as label] cluster scores distribution for the [4.6] — [12] 
WISE color used as label. 
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the peculiar WISE colors distribution of the blazars. 

An even stronger correlation has been observed in the 
third experiment. The third experiment has involved the 
same dataset used in the first experiment, comprising 
blazars from the BZCat with optical and near-infrared 
colors as features and the WISE infrared colors [3.4] — 
[4.6], [4.6]-[12] and [12]-[22] as labels instead. The three 
WISE colors, used as labels, have been binned as shown 
in Table m 

The lower central and left plots in Figure [8] show the 
scores values distributions for the clusterings produced 
by the K-means method on the third experiment (Exp. 3 
in Table J4]) calculated using as labels the two colors 
[3.4]-[4.6rand [4.6]-[12] from WISE. The large values of 
the scores for all clusterings indicate that a strong cor- 
relation exists between the distribution of sources in the 
clusterings and their mid-IR colors. Based on the binning 
used in this experiment for these two labels (see Tableffl), 
this result suggests that the distribution of blazars in the 
mid-infrared colors is peculiar, as most blazars are con- 
tained in a narrow region of the mid-IR WISE colors 
space. 

This fact is evident in the right plot of Figure |9j where 
the projection onto the WISE [3.4] - [4.6] vs [4.6] - [12] 
color-color plane of the distribution of the blazars in the 
feature space of the third experiment (Exp. 3 in Table |4| 
is shown. The regions occupied by the projections of 
the clusters of the clustering with five total clusters pro- 
duced by the K-means algorithm (see lower central and 
left plots in Figure [s]) are represented by shaded colored 
areas. The symbols of the points, as in the left plot in 
Figure 9j reflect the spectral classification. The horizon- 
tal ancTvertical black lines correspond to the bin limits 
of the bins used for the [3.4] -[4.6] and [4.6] -[12] WISE 
colors used as labels (see Table |4] for more details about 
the experiment). The clusters in this plot are associated 
with the values of the total and cluster scores shown 
in the column corresponding to the clustering with to- 
tal five clusters in both the lower central and left plots 
in Figure [Sj for the two WISE colors respectively. The 
red, green and magenta large regions correspond to the 
second, third and fifth clusters of the clustering respec- 
tively, while the small five-members group corresponds 
to the first cluster and the single-source cluster is the 
fourth cluster in the two plots of Figure |8] Multiple 
generic sources from the WISE catalog, drawn from a 
region of the sky at high galactic latitude, are plotted as 
small gray points to show that the locus occupied by the 
blazars is clearly separated from the high-density regions 
of the overall distribution of WISE sources in the same 
color-color plane. 

In this case, the application of CLaSPS has helped to 
determine a previously unknown correlation between a 
class of astronomical sources and a small number of fea- 
tures, as the clusters of all clusterings follow a narrow 
locus in the WISE [3.4] - [4.6] vs [4.6] - [12] color-color 
plane, occupied by the whole sample of blazars. While 
this pattern is clearly visible also in the low-dimensional 
two/three dimensional distribution of the blazars WISE 
colors, it has been discovered during the exploration of 
the multi-dimensional feature space generated by the 
multi-wavelength color of BZCat blazars. This is an 
example of low-dimensional pattern that had gone, so 
far, unnoticed and that a general method for the deter- 



mination of correlations in complex feature spaces, like 
CLaSPS, has helped to single out and characterize fur- 
ther. 

A thorough investigation of the spectral mid-infrared 
and 7-ray properties of this sample of blazars that 
res ult in this peculiar pa ttern, has been presented 
in (D'Abrusco et al. 2012). Some of the authors of 
this paper have also developed a method, based on 
the mi d-infrared properties of BZCat blazars discussed 
above ( Massaro et al.||2012 ), for the selection of blazars 
candidates from WISE photometric data. 

5. SUMMARY AND FUTURE DEVELOPMENTS 

In this paper we have presented CLaSPS, a new 
method for the determination of correlations in complex 
astronomical datasets, based on KD techniques for unsu- 
pervised clustering supplemented by the use of external 
information to label and characterize the content o f the 
clusters (Sec.[2|. We have introduced the score (Sec. 2.31 
and shown the reliability of the score as a measure oi the 
degree of correlation among the membership distribu- 
tion of sources in a clustering and the distribution of a 
quantitative or categorial label in distinct classes, using 
simulated clusterings (Sec. 2.4). 

We have also discussed the applications of CLaSPS to 
two different samples composed of extragalactic sources 
with multi- wavelength photometry used as features: the 
first dataset, CSC-I- (Sec. [S]), is composed of spectro- 
scopically confirmed quasars from the SDSS DR8 with 
multi-wavelength observations in the near-infrared, op- 
tical and ultraviolet, and detected (or with reliable up- 
per limits) in the Chandra X-ray CSC catalog; the sec- 
ond dataset (Sec. |4]) is composed of optically confirmed 
blazars with mid-mfrared, near-infrared and optical ob- 
servations, complemented, for a subset of the sources, by 
7-ray data from the 2FGL. 

The main result of the application of CLaSPS to 
the CSC-I- dataset has been the confirmation of a well 



known correlation (see, for example Lusso et al. 20101 



between the near-ultraviolet/blue optical luminosity of 
optically selecte d ra dio-quiet quasars and the spectral 
index aox (Sec. 3.3) in a subset of the highly inhomoge- 
neous CSC-I- sample. CLaSPS has narrowed the CSC+ 
sample to three specific clusters that show significant cor- 
relation between the Lopt(2500A) mono-chromatic lumi- 
nosity and the aox spectral index, based on the cluster- 
ing of the CSC-I- sample in the feature space generated 
by the near-infrared, optical and ultraviolet photometric 
data. Further analysis of the results have shown that 
the correlation for the subset of sources contained in the 
correlated clusters is driven by the values of the nuv — u 
color, as an indicator of the presence of the "big-blue- 
bump" component in the SEDs of the sources. 

In the case of the experiments performed on the blazars 
sample, CLaSPS has revealed an unknown correlations 
between the spectral cl assifi cation of the blazars in BZQs, 
BZBs and BZUs (Sec. 4.1 1 and their distribution in the 
feature space generated by mid-infrared, near-infrared 
and optical colors. Further investigation has shown that 
the correlation is almost entirely attributable to the pe- 
culiar pattern followed by BZCat and 7-ray detected 
blaza rs f ollow in the WISE mid-infrared color space 
(Sec. 4.3). The implications of this pattern on the mod- 
eling oi blazars emission mechanism and a novel method 
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Figure 9. Left plot: projection of the distribution in the multi-dimensional feature space of the Blazars sample used in the first experiment 
(Exp. 1 in Table |4]l onto the [4.6] -[12] vs [3.4] -[4.6] ^m WISE color-color plane. Blazars of different spectral classes (BZB, BZQ and 
IBZU) according to the BZCat are plotted with different symbols. The shaded regions correspond to the projections of the three clusters of 
the clustering with three total clusters obtained with the K- means algorithm using the spectral class as label (upper-mid plot of Figure [sj. 
The interpretation of the clustering is discussed in Sec. |4.3| Right plot: plot of the distribution of the Blazars sample used in experiment 
three (Exp. 3) in the WISE color-color plane generated by the colors [4.6] — [12] and [3.4] — [4.6], used as labels. The horizontal and vertical 
black lines represent the edges of the label bins used in the experiment three for the [3.4] — [4.6] and [4.6] — [12] WISE colors respectively (the 
distribution of scores associated with these two labels are shown in the lower central and left plots in Figure [sj. The background grey dots 
are shown for referenc e and correspond to 45 3420 WISE generic sources detected at high Galactic latitude. This plot has been adapted 
from a similar plot in i jD'Abrusco et al.|2012^ . 



for the selection of candidate blazars from mid-infrared 
survey photometric data based on such pattern hav e been 



investigated in other works by some of th e authorsj Mas 
saro et al.|[20TT| |D ' Abrusco eTaTI [2012} [Massaro et al. 

tii 



While in this paper we have described applications of 
CLaSPS to inhomogeneous samples obtained by feder- 
ating data from general purpose large area surveys, we 
plan to apply the method to large homogeneous sam- 
ples of e xtragalactic sources , like the C handra- (705'M06' 
dataset (lElvis et al |.2009, ,Civano et al..,2012) . 

CLaSPS selects tne optimal clustering based on the 
scores, a measure of the correlation between the clus- 
tering membership and a given partition of one external 
observ able used as label. For this reason, as discussed 
in Sec. |2.1[ CLaSPS differentiate itself from "cluster en- 
sembles" techniques. Nonetheless, three different aspects 
of the current CLaSPS method could be improved by 
the application of cluster ensembles techniques: 1) the 
limited number of clustering techniques used may bias 
the exploration of the clusterings towards particular as- 
pects of the feature distribution of the dataset consid- 
ered. IMoreover, CLaSPS docs not to take into account 
the properties and, potentially, weaknesses of each dis- 
tinct clustering techniques; 2) the choice of the optimal 
clustering is based on a single label at the time. Corre- 
lations between a given set of clusterings and multiple 
labels cannot be captured by CLaSPS, but are left to the 
interpretation of multiple distinct label experiments; 3) 



the choice of the optimal clusterings in CLaSPS is based 
on a single "view" of the dataset, i.e. on clusterings ob- 
tained using a single set of sources and/or features. 

The first point could be easily addressed by widen- 
ing the portfolio of clustering methods used by CLaSPS. 
Then, cluster ensembles methods could be applied to sub- 
sets of clusterings (grouped by total number of clusters or 
by type of clustering method) to determine the "consen- 
sus clustering" of each subset of clusterings. The scores 
would then be evaluated on the set of consensus cluster- 
ings determined in this way. The second point could be 
similarly addressed by searching for the "consensus clus- 
terings" of the set of optimal clusterings selected through 
the scores values for different labels. 

The third point is particularly important for astron- 
omy, because most astronomical datasets present differ- 
ent number of features available for different members of 
the dataset. In its current implementation, CLaSPS can 
be applied only to clusterings obtained with a fixed given 
subset of sources and features. CLaSPS, in this scenario, 
can be applied separately to distinct groups of sources 
in the dataset with a set of common features. In order 
to overcome this limitation, distinct sets of clusterings 
could be obtained for different "views" of the dataset, i.e. 
different subsets of the datasets with the same set of fea- 
tures available. Then, the multiple clusterings obtained 
with on the different views of the dataset with different 
clustering techniques could be consolidated into a single 
set of clusterings through the application of cluster en- 
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sembles technique on the groups of clustering obtained 
with the same clustering technique on distinct views of 
the dataset. This approach is similar to "features dis- 
tributed clustering" and "object distributed clustering" 
scenario s typical of practical application of clustering en- 
semble ( |Strehl fc Ghosh„2003| ). 

A further improvement to the CLaSPS method is re- 
lated to the choice of the classes of the labels. In the fre- 
quent case of quantitative continuous labels, the choice 
of the binning is crucial for the evaluation of the scores 
and, in turn, for the determination of the correlations 
among features and labels, if any. While the astronomer 
deciding the binning of the labels on the basis of a pri- 
ori knowledge of the specific topic considered is a viable 
option for most cases where the astronomer tries to gen- 
eralize an already known correlation or a generic problem 
(e.g. the characterization of astronomical sources based 
on their photometric parameters for this paper) is inves- 
tigated, this can be a limitation to the generality of the 
method when the aim of the experiments is a "blind" 
exploration of multi-dimensional astronomical datasets. 
In order to improve this aspect of the CLaSPS method, 
we are exploring the possibility of complementing the 
astronomer's definition of classes of labels with sponta- 
neous classes that can be determined from the intrinsic 
distribution of the labels themselves by the application 
of non-parametric KD techniques. 
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